Part1. QUESTIONS AND GRAPHS

III.Questions about directors and actors.

a)Who is the most valuable director?
b)Who is the most valuable actor?

We intend to find who is the most valuable director and who is the most valuable actor. To measure the word, valuable, in quantitative scale, we use to variables-IMDB score and total profit. We selected directors and actors with IMDB score higher than 6.0 and put the score in x axis and total profit in y axis. The two plots showed us very intriguing results. We observed that there are a few directors and actors who have every high IMDB scores, but gained low profit. They are indicated by blue dots. Nevertheless, there are a considerable number of directors and actors who have both high score and high profit. We labeled them in red dots. We could spot some famous names among directors such as James Cameron, Christopher Nolan and Steven Spielberg.

We could also observe popular actors and actresses in the plot such as Scarlett Johansson, Leonardo DiCaprio and Brad Pitt. Thus, using IMDB score and total profit could give us an intuitive result about the profitability of directors and actors. Although further analysis need to be conducted to find more detailed and reliable results, these two plots could give us a preliminary result about the most valuable director and actor.



PART2. TABLES


I.Table that summarizes the whole data

color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name movie_title num_voted_users cast_total_facebook_likes actor_3_name facenumber_in_poster plot_keywords movie_imdb_link num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
Length:4846 Length:4846 Min. : 1.0 Min. : 7.0 Min. : 0 Min. : 0.0 Length:4846 Min. : 0.0 Min. :0.000e+00 Length:4846 Length:4846 Length:4846 Min. : 5 Min. : 0 Length:4846 Min. : 0.000 Length:4846 Length:4846 Min. : 1.0 Length:4846 Length:4846 Length:4846 Min. :1.100e+03 Min. :1916 Min. : 0 Min. :1.600 Min. : 1.180 Min. : 0
Class :character Class :character 1st Qu.: 50.0 1st Qu.: 94.0 1st Qu.: 7 1st Qu.: 130.5 Class :character 1st Qu.: 607.5 1st Qu.:4.197e+06 Class :character Class :character Class :character 1st Qu.: 8351 1st Qu.: 1394 Class :character 1st Qu.: 0.000 Class :character Class :character 1st Qu.: 64.0 Class :character Class :character Class :character 1st Qu.:5.000e+06 1st Qu.:1999 1st Qu.: 277 1st Qu.:5.800 1st Qu.: 1.850 1st Qu.: 0
Mode :character Mode :character Median :110.0 Median :103.0 Median : 48 Median : 365.0 Mode :character Median : 984.0 Median :2.769e+07 Mode :character Mode :character Mode :character Median : 33520 Median : 3075 Mode :character Median : 1.000 Mode :character Mode :character Median : 155.0 Mode :character Mode :character Mode :character Median :1.800e+07 Median :2005 Median : 593 Median :6.600 Median : 2.350 Median : 158
NA NA Mean :139.6 Mean :107.9 Mean : 692 Mean : 632.0 NA Mean : 6566.9 Mean :8.623e+07 NA NA NA Mean : 83232 Mean : 9665 NA Mean : 1.371 NA NA Mean : 269.6 NA NA NA Mean :3.299e+07 Mean :2002 Mean : 1635 Mean :6.422 Mean : 2.152 Mean : 7342
NA NA 3rd Qu.:193.0 3rd Qu.:118.0 3rd Qu.: 190 3rd Qu.: 635.0 NA 3rd Qu.: 11000.0 3rd Qu.:9.396e+07 NA NA NA 3rd Qu.: 94279 3rd Qu.: 13740 NA 3rd Qu.: 2.000 NA NA 3rd Qu.: 322.8 NA NA NA 3rd Qu.:4.000e+07 3rd Qu.:2011 3rd Qu.: 912 3rd Qu.:7.200 3rd Qu.: 2.350 3rd Qu.: 2000
NA NA Max. :813.0 Max. :511.0 Max. :23000 Max. :23000.0 NA Max. :640000.0 Max. :2.784e+09 NA NA NA Max. :1689764 Max. :656730 NA Max. :43.000 NA NA Max. :5060.0 NA NA NA Max. :4.200e+09 Max. :2016 Max. :137000 Max. :9.500 Max. :16.000 Max. :349000
NA NA NA’s :46 NA’s :14 NA’s :39 NA’s :23 NA NA’s :7 NA’s :158 NA NA NA NA NA NA NA’s :13 NA NA NA’s :20 NA NA NA NA’s :96 NA’s :43 NA’s :13 NA NA’s :322 NA





II.Table that summarizes the genre data

Genres Number of Movies Weights in All Genres Mean Gross in 2001 Mean Gross in 2002 Mean Gross in 2003 Mean Gross in 2004 Mean Gross in 2005 Mean Gross in 2006 Mean Gross in 2007 Mean Gross in 2008 Mean Gross in 2009 Mean Gross in 2010 Mean Gross in 2011 Mean Gross in 2012 Mean Gross in 2013 Mean Gross in 2014 Mean Gross in 2015 Mean Gross in 2016 Mean Gross Mean Profit
Action 1099 7.91 7.5113359 5.0348338 3.6659315 6.1610049 5.3109100 9.2933920 9.3247616 6.9523944 11.1307750 9.9881282 8.6021768 13.8992884 10.7120841 9.0187928 15.4251190 9.5714290 5.0801528 5.7153813
Adventure 878 6.32 5.5362595 6.6593737 7.0536493 6.9766441 7.3259330 7.3137728 11.9040353 10.1560429 9.7798759 13.0022783 13.1423075 14.5044585 12.0745704 8.7810398 21.3889007 14.1425232 4.8394623 6.3247249
Animation 233 1.68 1.2743054 2.6344065 1.4076387 2.6671145 3.4189402 1.9859662 3.7050303 4.1424675 3.9280398 3.7226985 4.9816557 3.4491392 4.3410987 1.8819536 24.6721281 16.3805887 1.4992517 1.1446435
Biography 290 2.09 0.2482900 0.8268890 0.5006664 0.6655225 0.7232949 0.2832208 0.9184040 0.6639806 0.4592782 0.5509327 1.6751161 1.5527428 1.2447248 0.1715944 5.6671196 3.2204343 0.5458465 0.4278311
Comedy 1817 13.08 6.3135925 6.2782505 5.5654253 7.0248587 6.6313673 6.5561736 7.4834405 8.0553552 8.8167848 8.5377880 8.3677087 7.2378317 7.2213983 5.4196184 8.1840969 5.2999010 5.2076130 5.2855890
Crime 850 6.12 2.2275215 1.7363677 1.8420637 2.0705889 1.9190191 2.5618709 1.8452078 1.3744541 2.8351667 1.9909499 2.9746719 2.3618794 3.0741678 0.9705592 6.1597263 2.7949582 2.7619023 2.3630079
Documentary 121 0.87 0.0303067 0.2581488 0.1063518 0.1308012 0.0655831 0.0624536 0.1779249 0.1967070 0.1319757 0.0707202 0.0671857 0.0001423 0.0000855 0.0000000 1.3887833 0.9644173 0.0428024 0.1569546
Drama 2484 17.88 5.1757259 4.9022803 4.8857020 6.1129869 4.1574639 8.0350153 5.8787095 5.4907725 4.9054520 6.9132858 6.0174162 8.2031323 7.0748358 2.7122157 5.7746380 3.2301933 4.6359725 6.1435429
Family 522 3.76 2.7756047 4.6443551 4.6498030 4.3013900 5.0755897 4.7462867 5.7610736 7.2129003 5.3168796 5.1230471 5.6830006 5.0151444 5.3073983 3.2243129 18.1183430 12.3750335 2.4679865 3.3040821
Fantasy 571 4.11 3.4203568 3.9765935 3.1728971 4.9017013 5.6133112 3.9915152 7.4471926 8.5155584 5.7441466 7.2807121 7.9931934 6.4181920 3.8406592 2.3744208 17.9812686 11.5767160 4.2649923 4.2234106
Film-Noir 6 0.04 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5300000 0.3842400 0.0000000 0.0000000
Game-Show 1 0.01 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.6882378 1.5882378 0.0000000 0.0000000
History 200 1.44 1.0040606 0.3891286 0.4285096 0.5665936 0.3663727 0.6388674 0.4077878 0.5642908 0.1483307 0.6916353 0.3267927 0.8188780 1.0108431 0.2102009 6.5065776 2.9859592 0.7063565 0.2647965
Horror 535 3.85 0.8555357 1.2240064 1.4101534 0.9690721 1.3558102 1.2304313 0.9386224 1.2128496 0.9710811 1.6731958 1.8966377 1.0272911 0.6957171 0.7508504 4.9354089 3.1503099 0.4358286 0.5607328
Music 209 1.50 0.4393631 0.3409285 0.4099673 0.7991010 0.9363650 0.4176639 0.7821179 0.4671639 0.4715517 0.3729657 0.2861659 0.2257222 0.8693450 0.0443235 5.2228877 3.4092869 0.2268857 0.3840162
Musical 129 0.93 0.2010248 0.2202015 0.1050264 0.1684012 0.5726231 0.9350985 0.4115464 0.6981324 0.7628084 0.2108741 1.3509458 0.9818237 0.0000000 0.0000000 9.7048240 6.8209531 0.0644281 0.3525323
Mystery 469 3.38 0.8753704 2.9847455 2.6060465 2.0650152 2.4663044 1.1278823 3.0337066 1.0664908 2.4473679 1.1820572 1.3231570 1.9764144 0.9639744 0.9221128 8.2563691 5.0673599 0.7971685 2.3881900
News 3 0.02 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0191215 0.0000000 0.0000000 0.0000000 0.0007409 0.0000000 0.0000000 0.0000000 0.6620821 -0.0983512 0.0000000 0.0000000
Reality-TV 1 0.01 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.6882378 1.5882378 0.0000000 0.0000000
Romance 1069 7.69 2.2566642 5.0416630 3.4418709 2.9037801 3.2572954 4.9505569 3.7935779 4.9905684 2.3822034 2.5683145 1.5257596 2.5658222 1.7679143 1.5471284 7.0302851 4.4717040 2.9332282 3.0518185
Sci-Fi 580 4.17 2.7900955 1.9852999 2.2179700 1.4032017 1.5226040 2.7674792 8.3710624 2.5597188 4.2696645 5.6192155 8.4344678 9.6668194 8.2730364 5.7962020 15.8368669 10.1614026 1.8724605 2.4054973
Short 5 0.04 0.0000000 0.0000000 0.0000000 0.0000000 0.0059187 0.0075189 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.2786963 -0.3123037 0.0000000 0.0000000
Sport 176 1.27 0.2174055 0.5419665 0.7743276 1.1244454 0.4909349 0.3290597 0.4590991 0.2120606 1.0067106 0.1316734 0.5553004 0.2622473 0.7115048 0.0700666 6.0399776 3.1415766 0.2654963 0.4002580
Thriller 1348 9.70 2.9990741 4.3714872 4.1243118 5.1675688 3.5709262 4.7007420 3.9046643 4.6536180 4.9463396 7.0516381 6.7165368 6.2413112 8.1795891 2.3495328 8.4273057 4.9689390 4.3291819 4.5934421
War 207 1.49 0.9485443 0.6133800 0.4057599 0.6758511 0.3926974 0.7265270 0.3861284 0.2747973 0.1624909 0.4144702 0.2499750 1.8874680 0.0444224 0.1259870 6.9520090 3.5084852 0.8276862 0.4618998
Western 93 0.67 0.0686140 0.2084806 0.1538371 0.0195353 0.0878110 0.0280399 0.0000000 0.2741451 0.2481137 0.4583953 0.2600021 0.0460585 0.6786576 0.0029766 5.8072120 2.8088210 0.0142245 0.1065153







PART3. ANALYSIS

I. Number of movies VS. years

We first draw a plot to explore the relationship.

We find that there may be linear relationship between log(year) and number of movies, Thus we run regression analysis

Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6015224 0.1066876 -5.638164 2e-07
x 0.0676408 0.0020141 33.584483 0e+00
We observe strongly significant and R-squared = 92.6%




II.Analyse the relation between year and number of critic for reviews.

First of all, pick all the numeric variables and remove all the rows containing missing values.

We would like to know the relation between year and number of critic for reviews. From the data, we know that the sample sizes for the period before 1980 are to small. There are too much noice in the data for the period before 1981. We only consider the data for the period after 1981. Besides, we delete some of sample points which are too extreme.

We calculate the evolution of the mean of num_critic_for_reviews with time.

From 2014, the decline of of film industry make num_critic_for_reviews decreases. This phenomenon is not natural. We only consider the evolution between 1990 and 2013. This evolution could be fitted by a quadratic function:

\[f(x)=0.85(x-1995)^2+52\]



From the above graph, we could see that there is a decresing trend before 1994 and an increasing trend after 1994. We use data after 1994 to explore the line relation between critics of reviews and title year to see whether our quadratic function reflect the true relation.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -187.3435185 11.2706393 -16.62226 0
year_critic 0.0958022 0.0056241 17.03437 0

During the period 1995-2013, log of num_critic_for_reviews increased at a linear speed at the time (year) evolved. The relationship could be expressed as:

\[y=-194.71693+0.09958x\]




III.Analyse the evolution of movie_facebook_likes with time.

Draw the graph showing the evolution of movie_facebook_likes during the period 1986-2016.

From 2014, the decline of of film industry make movie_facebook_likes decline. This phenomenon is not natural. We only consider the evolution between 1993 and 2013. This evolution could be fitted by a quadratic function:

\[g(x)=200(x-2000)^2+1600\]

From the above graph, we could see that there is a decresing trend before 2000 and an increasing trend after 2000. We use data after 2000 to explore the line relation between critics of reviews and title year to see whether our quadratic function reflect the true relation.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -409.1178990 37.1779187 -11.00432 0
year_fblike 0.2081514 0.0185148 11.24241 0

During the period 1995-2013, log of movie_facebook_likes increased at a lnear speed at the time (year) evolved. The relationship could be expressed as:

\[y=-428.7474+0.218x\]




IV.Distance analysis for classification

Finally, we tried to do classification based on the data using distance analysis. We want to classify the movie into groups with imdb score higher than 8.0 and less than 8.0.

The first step is to find out significant variable related to imdb score used for the analysis. Thus, we run regression analysis first to find out the significant variables.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5596374 0.0550183 119.226496 0.0000000
aspect_ratio -0.0511509 0.0248119 -2.061546 0.0393106
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.3908696 0.0187618 340.631858 0
cast_total_facebook_likes 0.0000059 0.0000009 6.532097 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5021751 0.0197386 329.413639 0.0e+00
facenumber_in_poster -0.0372243 0.0080937 -4.599151 4.4e-06
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.9972764 0.0248047 241.77964 0
num_critic_for_reviews 0.0030096 0.0001286 23.39832 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5364386 0.0756187 59.99097 0
duration 0.0175496 0.0006788 25.85537 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.3342732 0.0170188 372.19182 0
movie_facebook_likes 0.0000146 0.0000008 18.43094 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4004486 0.0166237 385.02028 0
director_facebook_likes 0.0000678 0.0000055 12.43521 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1424274 0.0173876 353.2649 0
num_voted_users 0.0000034 0.0000001 33.0110 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4069868 0.0180362 355.228330 0
actor_1_facebook_likes 0.0000064 0.0000011 5.801156 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4087059 0.0177044 361.983859 0
actor_2_facebook_likes 0.0000241 0.0000039 6.224397 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4251565 0.0176029 365.005857 0.00e+00
actor_3_facebook_likes 0.0000386 0.0000096 4.043426 5.36e-05
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.431449 0.0177846 361.62957 0.0000000
budget 0.000000 0.0000000 2.87624 0.0040441
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.337511 0.0183946 344.53107 0
gross 0.000000 0.0000000 12.78592 0

We define a significant variable when the p-value of regression is smaller than 0.001.

The second step is to run distance analysis using significant variables. We use two mehods to analyze the data. One is to use Euclidean Distance twice as inputs, the other is to use Euclidean distance and Mahalanobis distance as inputs. We define training set as half of movies in both groups and the other half are used as testing sets. The results are like following.

## [1] "When using both euclidean distance, the successful classification probability is 0.9307"
## [1] "When using euclidean distance and mahalanobis distance, the successful classification probability is 0.9037"

The resuts show that our method has a high prediction accuracy.